

# RTL Synthesis: From Logic Synthesis to Automatic Pipelining

The authors review the evolution of RTL synthesis from the early techniques for combinational logic synthesis. Recent methods for automatic pipelining based on elastic timing are then introduced and proposed for microarchitectural exploration.

By Jordi Cortadella, Fellow IEEE, Marc Galceran-Oms,
Mike Kishinevsky, Senior Member IEEE, and Sachin S. Sapatnekar, Fellow IEEE

ABSTRACT | Design automation has been one of the main propellers of the semiconductor industry with logic synthesis being one of the core technologies in this field. This article reviews the evolution of logic synthesis until the advent of techniques for automatic pipelining based on elastic timing, either synchronous or asynchronous. The emergence of these techniques can enable a productive interaction with tools that can do microarchitectural exploration of complex designs.

**KEYWORDS** | Design automation; logic synthesis; high-level synthesis; architectural pipelining; timing elasticity

### I. INTRODUCTION

Since the early days of manual layout, electronic design automation (EDA) has witnessed a proliferation of abstract models, algorithms, description languages, and design frameworks that have allowed to scale productivity. After the introduction of the first very large scale integration (VLSI) layout tools, designers started working with objects at a higher level of abstraction, logic gates, and logic synthesis emerged as one of the main engines to

satisfy the competitive requirements of quality and short time-to-market.

Most of the innovation in logic synthesis has taken place in academia, where University of California, Berkeley (UC Berkeley) has been one of the leading actors. Since the early 1980s, logic synthesis has evolved to accommodate the increasing complexity of circuits. Today, we have specification models at the level of algorithms and transactions that enable designers to specify behaviors at abstraction levels similar to those used in software.

In this long journey, timing has always been one of the essential design parameters under optimization and a variety of timing models have been proposed to devise algorithms that can deal with performance and interfacing constraints. We can model the execution time (E) of an algorithm running for a specific data set by the following *Performance Equation*:

$$E = N \cdot P \tag{1}$$

Manuscript received September 15, 2014; revised April 11, 2015; accepted July 6, 2015. Date of publication September 24, 2015; date of current version October 26, 2015. This work has been partially supported by funds from the Spanish Ministry for Economy and Competitiveness and the European Union (FEDER funds) under grant TIN2013-46181-C2-1-R and a Fulbright award.

J. Cortadella is with the Universitat Politècnica de Catalunya, 08034 Barcelona, Spain (e-mail: jordi.cortadella@upc.edu).

**M. Galceran-Oms** is with eSilicon, Llull 321, 08019 Barcelona, Spain (e-mail: mgalceran@esilicon.com).

**M. Kishinevsky** is with Intel Corporation, Hillsboro, OR 97006 USA (e-mail: Michael.Kishinevsky@intel.com).

**S. S. Sapatnekar** is with the University of Minnesota, Minneapolis, MN 55455 USA (e-mail: sachin@umn.edu).

Digital Object Identifier: 10.1109/JPROC.2015.2456189

where N is the number of cycles required to execute the task and P is the cycle period.

In combinational logic synthesis [3], [4], timing optimizations are directed to shorten the critical paths in the combinational regions and reduce *P* without modifying the sequential elements (latches and flip-flops) of the circuit. Sequential optimizations, such as retiming [34], permit the modification of the internal sequential behavior of the circuit and give more opportunities for timing optimizations. Still, all the optimizations preserve the

external cycle-level behavior (N is not modified), thus avoiding any re-synthesis of the interfaces.

In the last decade, high-level synthesis (HLS) [17] has evolved as a competitive alternative to manual register transfer level (RTL). Among different techniques for scheduling operations under performance and resource constraints, several forms of loop pipelining have been proposed to reduce N through the parallelization of operations from different iterations [22].

An essential step for timing optimization takes place when the strict timing constraints at the external interfaces are relaxed and handshaking mechanisms are introduced to determine the validity of data at the communication channels. This new paradigm is known as time elasticity [11], which opens a new avenue of optimizations that can either be applied to asynchronous or synchronous circuits. With this new scenario, the optimization parameter is average rather than worst-case performance and automation can play a relevant role in architectural pipelining [20]. Synthesis tools for architectural exploration can be created, thus conquering a field that was only in the domain of manual design.

This article gives a historical view and discusses the evolution of synthesis methods from combinational logic synthesis to automatic pipelining. The classical combinational and sequential logic synthesis techniques are first presented, with strict timing models that do not allow any flexibility at the interfaces. HLS is next discussed and techniques for automatic pipelining are reviewed. General models for time elasticity and handshaking schemes are finally introduced. Asynchronous and synchronous elastic circuits are two related paradigms that exploit elastic timing using different variations of handshakes. The article continues with optimizations that can be introduced when time elasticity is allowed and are the core for automatic pipelining at RTL. The article finally debates about how automatic pipelining can be exploited for architectural exploration. An example shows how a non-pipelined version of the DLX CPU can be sped-up by 3× with a 20% area overhead by using automatic pipelining techniques.

# II. LOGIC SYNTHESIS

### A. Combinational Logic Synthesis

The goal of combinational logic synthesis is to optimize the timing, area, and power of an implementation under the best models at this stage of design. This step is divided into technology-independent synthesis, which optimizes logic expressions to obtain low implementation costs by reducing literal counts or logic depths. and technologydependent synthesis, which maps the logic expressions to a cell library of precharacterized gates, using more accurate area, delay, and power numbers. Since the loads on logic gates are related to placement, technology-dependent synthesis is often interleaved with placement. Most

combinational synthesis algorithms implicitly assume that the circuits are acyclic. Although combinational logic may be implemented with cyclic circuits [44], such structures are not used in mainstream designs.

Early approaches to technology-independent synthesis used exact methods for two-level minimization based on sum-of-products or product-of-sums representations [18]. Since exact logic minimization is an  $\Sigma_2^p$ -complete problem [6], [51], these methods were largely supplanted by heuristic methods [4]. These approaches were then generalized to multilevel logic minimization [3], with enhancements based on algebraic division and algorithmic transformations of logic expressions. Multilevel methods are a vital ingredient in optimizing typical circuits that contain 10-20 levels of logic.

The problem of technology-dependent logic optimization for area/power minimization has no known exact solution in polynomial time, and is typically solved using heuristics. Beginning with a logic decomposition using simple primitives such as two-input NANDs and inverters, the combinational circuit is modeled as a directed acyclic graph. If this is decomposed into a forest of trees where each gate in a tree has at most one fanout, then each such tree can be mapped optimally in linear time in the number of tree vertices [30]. Heuristics have also been developed for providing practical solutions with area/power/delay tradeoffs. Such technology mapping approaches suffer from structural bias, wherein the solution quality depends on the initial decomposition. Methods for overcoming this problem are summarized in [13].

From a timing point of view, the main goal of combinational optimization is to ensure that the longest path through any combinational stage meets the setup time constraint: for edge-triggered circuits, the sum of the longest combinational path delay and the setup time cannot exceed the clock period.

### B. Sequential Logic Synthesis

While combinational timing optimization operates on one combinational block at a time, sequential optimizations address the memory elements within the system. These memory elements may be level-clocked transparent latches or edge-triggered flip-flops, and we will refer to them as "latches" in our discussion. Various sequential logic synthesis operations, such as state minimization and state encoding, can optimize sequential elements in a system. Such operations are primarily directed towards reducing the amount of sequential and combinational logic, and can impact circuit timing by reducing the complexity and delay of a logic implementation.

Time-borrowing between sequential stages is possible through latch-based optimizations such as retiming [34] and clock skew scheduling [19]. Retiming moves memory elements across logic while leaving unchanged the latency on all cycles and input-output paths. By moving combinational boundaries, retiming allows the clock period of a circuit to be improved without altering the externally observable cycle-level timing. Moreover, by moving latches across logic, the number of latches in a system may be reduced.

Fig. 1(a) shows an edge-triggered sequential circuit, where the rectangles correspond to latches and the combinational elements are represented by circles annotated with the element delays. Assuming zero setup times, the clock period, corresponding to the longest combinational path between latches, is 28 units, along the path containing the shaded circles. After retiming some flipflops along the arrows shown in the figure, the resulting circuit is shown in Fig. 1(b), where the period is 17 units. The reader can verify that this is the minimum clock period achievable by the system under retiming. Fig. 1(c) shows a solution with the minimum number of latches at the expense of increasing the cycle period to 18 units.

The retiming problem can be formulated as an integer linear programming problem [34] that can be solved using graph and network flow algorithms. Further efficiencies are introduced in [35] and [46]. Another optimization related to retiming is c-slowing [34], where each memory element is replaced by c consecutive elements. In restricted situations where unrelated data sets are processed, e.g., in acyclic pipelines, this operation can increase the circuit throughput.

# III. HIGH-LEVEL SYNTHESIS AND AUTOMATED PIPELINING

The techniques used in HLS [17] aim at transforming algorithmic descriptions into RTL. Performance and resource constraints are common parameters defined by designers to provide guidelines in the generation of solutions. Optimizing performance under area constraints or optimizing area under performance constraints are typical scenarios in the exploration of the design space.

Many of the techniques used in HLS have been inherited from the body of knowledge of parallelizing compilers [23], including techniques for loop pipelining [24], [31]. The goal is to reduce E in (1) by reducing both N and P. The major impact of loop pipelining is in the reduction of N.

Fig. 2 depicts a classical example of loop pipelining in which the algorithmic description is shown on the left and



Fig. 1. Sequential circuit in (a) original form. The same circuit after retiming with a (b) larger and (c) smaller number of latches.



Fig. 2. Loop pipelining of a FIR filter [24].

a pipelined schedule is shown on the right. The schedule parallelizes operations from different iterations of the same loop (shaded region).

One of the main components of an optimizing compiler is the module for dependency analysis that determines the sequencing constraints of operations. As an example, readafter-write (RAW) dependencies prevent operations that write and read the same operand to be re-ordered. For example, the sequence of instructions

$$\begin{split} \mathtt{R}[\mathtt{i}] &= \mathtt{a} * \mathtt{b}; \\ \mathtt{c} &= \mathtt{R}[\mathtt{j}] + \mathtt{d} \end{split}$$

cannot be reordered unless  $i \neq j$ . The optimizations during HLS are based on static analysis techniques. Data dependencies are analyzed taking into account the worst-case scenarios and produce *static schedules* that are valid for any possible data. For this reason, static schedules cannot fully exploit parallelism when dependency analysis is not able to provide conclusive information.

Automated pipelining without using classical HLS techniques has been previously proposed in [32] and [41]. Both methods assume that the hardware of the datapath is already partitioned into pipelined stages. Then [32] generates a stall controller and the forwarding logic, while [41] informs the designer of the available opportunities for applying forwarding and speculation and based on the designer choice generates a stalling engine. In both cases, the stalling engine is a global nonpipelined controller.

In modern technology nodes, both control and datapath need to be pipelined to avoid long logic and wire delays. The elastic pipelining techniques that will be later presented in this article allow to naturally distribute the pipeline control along the datapath. This is achieved by performing RTL transformations and using distributed handshake protocols. Elastic pipelining allows to keep track of the validity of data and define distributed pipeline boundaries that can easily interact with other components of the system. Additionally, elastic pipelining can deliver

dynamic schedules by analyzing data dependencies "onthe-fly" and stalling the pipeline when data hazards are produced.

### IV. ELASTIC TIMING MODELS

The behavior of a system can be modeled as an ordered set of computations and communications that deliver some output data after reading some input data. The time elapsed between reading inputs and delivering outputs may depend on different factors: complexity of the computation, optimization of the circuit, timing of the environment, etc.

The classical methods for combinational and sequential logic synthesis preserved the cycle-accuracy of the computations, i.e., two systems were said to be equivalent if their behavior could not be distinguished externally at any cycle of the computation. This equivalence allowed to modify the internal sequential behavior of the system (e.g., by retiming flip-flops or reencoding FSMs) but did not allow to modify the external behavior. It also enforced the environment to honor the cycle accuracy when producing the inputs and reading the outputs of the system. We refer to this model as *rigid*.

In a rigid timing model, it is not possible to break long combinational paths by introducing new pipeline stages, since this would modify the timing relationship between the inputs and the outputs of the system. However, modern systems include multiple complex components performing different computations. Very often these components work at different speeds even with clocks running at different frequencies. For the sake of modularity, it is convenient that these modules can be substituted by other instances with the same interface but different power/performance characteristics.

To preserve this type of modularity, circuit designers have proposed systems that tolerate variable latency (aka, time elasticity). As an example, most of the bus protocols have handshaking mechanisms to indicate when access is granted and when data is valid in the bus. Fig. 3 shows two signaling schemes for standard buses, VME [25] and AMBA AHB [2]. The main difference between both is the synchronization protocol: VME is asynchronous whereas AMBA is synchronous. As in most protocols, both have two handshake signals to implement time elasticity. One signal



Fig. 3. Handshaking schemes for buses.

goes from master to slave to indicate the availability of information in the bus. The other signal goes from slave to master to indicate the completion of the transfer.

In asynchronous protocols, data transfers are indicated by the events of the handshake signals and data signals are maintained stable as long as the handshake cycle has not been completed. In the VME bus, the DS signal (data strobe) indicates the period of time in which the master maintains valid data in the bus, while the data transfer acknowledge (DTACK) signal indicates when data has been accepted by the slave.

In synchronous protocols, the initiation and completion of transfers are indicated by the value of the handshake signals at the clock edges. In the AMBA AHB bus, the HTRANS signal (two bits) may indicate an IDLE cycle when no transfer must be performed. The slave can also indicate that it has not been able to accept the transfer (HREADY = 0), thus forcing the master to maintain the same data on the subsequent cycles until the transfer has been completed (HREADY = 1).

Even though the elastic interfaces offer the possibility of modifying the timing at the master or slave modules, the classical logic synthesis techniques do not take advantage of this flexibility for optimization and always preserve the original timing specification at the boundaries of the modules. To exploit elasticity, the timing specifications of the modules must be manually changed at RTL and a new synthesis process must be executed. This constraint prevents automated design explorations playing with time elasticity.

### A. Introducing Elasticity During Synthesis

Design automation requires formal models as a reference to preserve correctness when the implementation of a system is transformed by a set of synthesis rules. The incorporation of time elasticity during synthesis requires timing models that go beyond the cycle-by-cycle equivalence imposed by rigid timing.

One of the simplest and most general models for elasticity is Kahn Process Networks (KPNs) [28], where a system is a set of processes that communicate through unbounded first-in-first-out (FIFO) channels, as shown in Fig. 4. Every channel can be considered as an infinite queue of tokens and the execution model is similar to the firing semantics of Petri nets [40]. A process can start a computation when data is available at all input channels. Every computation reads data from the input channels and writes data to the output channels.

With this execution model, equivalence is defined in terms of data streams in the communication channels, i.e., two systems are equivalent if they produce the same output streams when reading the same input streams. Along this line, several formal models with subtle differences have been defined to decouple timing from data while preserving the equivalence of streams [8], [33]. We will refer to these models as elastic timing models.



Fig. 4. Kahn process network.

Fig. 5 depicts four equivalent streams. They transfer the same data in the same order but at different time instants. The first stream,  $s_1$ , could correspond to a rigid synchronous timing model. Streams  $s_2$  and  $s_3$  could represent two equivalent streams regulated by a synchronous clock in which data transfers occur at different cycles and some of the cycles, depicted as shadowed boxes, contain invalid data (bubbles). Finally,  $s_4$  could represent an asynchronous stream in which no global clock is regulating the data transfers.

At this point, an important observation can be made: When using elastic timing models, the latency of the system can be modified and, thus, long combinational paths can be broken by new sequential elements. This opens the door to a new avenue of exploration techniques that can synthesize circuits with different area/performance/power tradeoffs, in a similar way as different pipelines can be proposed for a microprocessor without modifying the instruction set architecture (ISA).

# B. Introducing Elasticity in Hardware

The KPN model assumes unbounded channels with non-blocking writes. For hardware implementations channels must be bounded and some strategy is required to determine the size of the first-in first-out (FIFO). Two options can be considered as follows:

- calculating an upper bound of the size of the FIFOs considering all possible behaviors of the KPN;
- 2) using blocking writes when FIFOs become full.

The first option is not always acceptable because boundedness imposes timing constraints on the environment. Moreover, the selected bounds also prevent a full flexibility in substituting components by other instances with different timing characteristics.

The second option is more appropriate for elastic timing but requires some handshaking mechanism to block



Fig. 5. Equivalent streams.

the sender when the FIFO at the output channel becomes full. The size of the FIFOs is important in elastic systems since it directly determines the performance of the system and the absence of deadlocks. This aspect will be discussed in Section VII.

To put it simply, the transformation of a rigid system into elastic can be done by substituting the sequential elements (registers) by FIFOs controlled by handshake signals and following the firing semantics of KPNs.

Fig. 6 shows timing diagrams for three signaling schemes corresponding to different timing models. The one in Fig. 6(a) corresponds to a rigid model in which data transfers occur at each cycle, i.e., no handshake signals are required. This scheme would produce streams similar to  $s_1$  in Fig. 5.

The scheme in Fig. 6(b) represents a synchronous elastic channel with a Valid/Ready protocol. The *Valid* signal indicates when the sender has valid data, whereas the *Ready* signal indicates that the receiver is able to accept the data. Data transfers occur when  $Valid \land Ready$ . This scheme is synchronous and the handshake signals are used to generate a gated version of the circuit clock (architectural clock) that triggers the data transfers. This scheme would produce streams similar to  $s_2$  and  $s_3$  in Fig. 5.

Finally, Fig. 6(c) depicts a four-phase asynchronous protocol. In this case, the *Request/Acknowledge* signals play a similar role as the *Valid/Ready* signals in the synchronous protocol. This scheme would produce streams similar to  $s_4$  in Fig. 5. More details about the asynchronous protocols will be given in Section V.

We can observe that the three schemes are very similar and the only difference between rigid and elastic systems is the implementation of the communication channels. This similarity can be observed in Fig. 7, where three different implementations are depicted. The leftmost diagram (a)



Fig. 6. Synchronization mechanisms: (a) synchronous rigid timing, (b) synchronous elastic timing, and (c) asynchronous elastic timing.



Fig. 7. Structure of the design: (a) synchronous rigid timing, (b) synchronous elastic timing, and (c) asynchronous elastic timing (matched delays version).

represents a rigid system with a global clock that triggers the flip-flops between combinational regions. The diagram in the middle is synchronous elastic and has been obtained from the rigid system by substituting the flip-flops by elastic buffers (FIFOs). The controllers implement the handshake protocol and generate the architectural clock as a gated version of the circuit clock. Finally, the rightmost diagram is an asynchronous elastic system. The only difference with the synchronous elastic system is in the controllers that generate the architectural clock using the asynchronous handshake signals. The delays in the req/ack signals are designed to match the delays in the datapath in such a way that the clock edges satisfy the setup/hold timing constraints.

Up to this point, the article has discussed the evolution of timing models to improve system performance. In the sequel, the article will focus on elasticity and how it can be used to optimize a design up to the point of performing automatic pipelining.

#### V. ASYNCHRONOUS CIRCUITS

Asynchronous circuits have been present for many years in the academic community as an alternative to the synchronous [48]. They have also had a limited presence in industry. The goal of asynchronous design is to adapt the delays to the dynamic requirements of the computations and the environment of a circuit.

This section reviews some basic protocols that have been broadly used in asynchronous design with the goal of illustrating possible implementations for elastic timing models, such as the one shown in Fig. 6(c). We refer the reader to [48] for a more extensive review of the area.

One of the pioneering efforts in asynchronous systems was the Macromodules project [37]with the introduction of the *Delay-Insensitive* (DI) modules that can interact with the environment under the presence of arbitrary delays at the inputs and outputs of the module. Seitz introduced *self-timed systems* [45] with the assumption that the components of the system were working with local timing constraints (e.g., negligible wire delays).

In general, different forms of asynchronous circuits have been proposed, all of them aiming at avoiding a global synchronization at system level. As discussed in Section IV, the implementation of these type of timing schemes requires a set of handshake signals with a synchronization protocol.

Fig. 8 illustrates the two most popular protocols using a pair of *request/acknowledge* signals. In the 4-phase protocol, a communication cycle involves four events and the handshake signals return to zero at the end of each data transfer. In the two-phase protocol, each cycle involves only two events (either rising or falling). Both protocols have been used extensively in asynchronous circuits, showing different advantages and disadvantages.

As shown in Fig. 7, the difference between synchronous and asynchronous timing models is mostly reduced to the implementation of the synchronization layer that generates the clock signal for the sequential elements of the circuit.

Fig. 9 depicts one of the best known schemes for asynchronous pipelines, based on a four-phase protocol implemented with Muller's C-elements [38]. The picture shows a linear pipeline with latches between regions of combinational logic (CL). The C-element at each latch guarantees that a latch becomes transparent when a new data is available at the input (Req = 1) and no data is stored in the following latch (Ack = 0). A FIFO (in the KPN sense) can be implemented as a sequence of latches with no logic in between. A similar scheme using a



Fig. 8. Asynchronous four-phase and two-phase communication protocols.



Fig. 9. Muller's pipeline.

two-phase protocol was proposed by Sutherland to implement the well-known *Micropipelines* [50].

The handshake protocol also requires certain timing constraints to guarantee that latches are neither missing nor overwriting data. For this reason it is necessary to guarantee a minimum separation of events that allows data to traverse combinational logic without having any setup/hold timing violation. Fig. 9 shows a delay for every *Req* signal to properly synchronize the clock signals at each latch. There are different ways to implement this delay, either by a chain of logic gates that mimic the delay in the combinational logic (bundled-data approach) or by implementing the combinational logic with multiple rails (e.g., dual rail) and using explicit circuits to detect the completion of the computation.

The latter case can be pushed to the limit when no timing assumptions are required to guarantee the correct sequencing of events. This can be achieved by using delay-insensitive codes [52]. However, the robustness provided by these schemes comes with a high cost in area and power [49].

### VI. ELASTIC CIRCUITS

The worlds of synchronous and asynchronous designs converged from different starting approaches. On the one hand, [42] studied synchronous implementations of the asynchronous circuits. On the other hand, [10] proposed latency-insensitive systems. Both forms in essence implemented the idea of an asynchronous system quantized by the global clock reference as illustrated by the timing diagram in Fig. 6(b) and the structural diagram in Fig. 7(b).

Since global *stall* signals are typically unacceptable for large systems, a distributed control is required to handle the *back-pressure* generated when a receiving block is *not ready* to accept new data. Extra storage is necessary inside the elastic buffer<sup>1</sup> (EB) to accommodate new incoming data from the previous block to the stalled block, while preserving the previously received data that have not yet been processed. When the *ready* signal is asserted, the EB

<sup>1</sup>Also called *relay station* in [10].

can deliver the stored data in FIFO order. A desirable property is that EBs do not introduce additional latency with regard to conventional synchronous registers in absence of back-pressure.

### A. Design of Elastic Buffers and Elastic Pipelines

Fig. 10 shows three implementation schemes for an EB that can hold two data items. Note that while this figure presents EBs capable of holding two data items, in general EBs are elastic FIFOs of arbitrary finite capacity.

Fig. 10(a) and (b) shows two possible implementations of an EB using edge-triggered registers. Other implementations are also possible, but in all of them a multiplexer is required to select data from one of the registers. Fig. 10(c) shows the structure of a more efficient implementation using transparent latches [16], [26]. The use of flip-flop based registers though can sometimes be more convenient for timing analysis or for the implementation in FPGAs that do not provide support for latches.

The designs depicted in Fig. 7 can be easily extended to netlists in which the elastic modules have multiple inputs and outputs connected to other elastic modules.

An example of such extension is depicted in Fig. 11. The controller in the middle of the figure is synchronized with the neighboring controllers. The *Join* block combines the handshake signals of the input modules with the ones of the controller. Similarly, the *Fork* blocks combines the handshake signals for the output modules. Intuitively, the *Join* block implements the conjunction of *valid* signals, whereas the *Fork* block implements the disjunction of *stop* (not ready) signals. Note that the same scheme applies for an asynchronous circuit with *req/ack* handshake signals.

### **B. Performance Analysis**

The performance analysis of an elastic system is founded on the well-known theory of marked graphs [14], [43], which can be briefly summarized as follows.

Every directed cycle in the system has a number of tokens and bubbles stored in EBs and distributed in a number of stages. A data transfer in a cycle implies swapping the location of a token and a bubble, i.e., the token moves to an empty slot in an EB, whereas the previous location of the token becomes empty. The number of tokens and bubbles is an invariant for every cycle.



Fig. 10. Implementation of elastic buffers.



Fig. 11. Synchronous elastic module with multiple inputs and outputs.

Every cycle *C* in the system has an associated throughput (tokens per cycle) that can be calculated as follows:

$$Th(C) = \frac{\min(T_C, B_C)}{S_C}$$

where  $T_C$  and  $B_C$  are the number of tokens and bubbles, respectively, and  $S_C$  is the number of stages in the cycle. Intuitively, the formula computes the average number of tokens per cycle that can be processed at every stage. However, given the duality between the move of tokens and bubbles, the performance may also be limited by the lack of space in the cycle (a token cannot move if there is no available space at the next stage). This is the reason for the *min* operator in the formula. In a strongly connected system with several cycles, the performance is determined by the most stringent cycle:

$$Th = \min_{C} Th(C).$$

Fig. 12 depicts an example for performance analysis in which the shadowed ovals represent combinational logic and the boxes represent two-slot EBs. The system has two simple cycles,  $C_1$  and  $C_2$ , with the following characteristics:

$$T_1 = B_1 = S_1 = 4;$$
  $T_2 = 4, B_2 = 6, S_2 = 5.$ 



Fig. 12. Performance analysis of an elastic system.

Therefore,  $Th = \min(Th_1, Th_2) = \min(1, 4/5) = 4/5$ , indicating that every computational unit will operate 4 out of 5 cycles, on average. It is interesting to note that the addition of one token in  $C_1$  would have a negative impact on performance since  $T_1 = 5$  and  $B_1 = 3$ , hence Th = 3/4.

Note that a cycle with zero tokens or zero bubbles has zero throughput (deadlock). Therefore, a necessary and sufficient condition for liveness is that every cycle has at least one token and one bubble.

Besides the global constraints, every EB requires a local performance constraint to sustain the maximum allowed throughput under the presence of back-pressure. In order to accommodate incoming tokens and locally propagate the handshake signals (valid/ready), every EB requires at least two slots. All the previous analysis can be extended to multi-cycle units or asynchronous systems with arbitrary forward/backward propagation delays [11].

# VII. OPTIMIZATION OF ELASTIC CIRCUITS

Section II-B describes some of the optimization transformations that can be applied for standard synchronous circuits. All of them are latency-preserving transformations in that they do not change the latencies of a computation as measured in clock cycles. All of these transformations can be applied to elastic designs.

Asynchronous and synchronous elastic designs, on the other hand, tolerate changes in latency. Such tolerance can be used to introduce novel correct-by-construction transformations enabling the exploration of new microarchitectural tradeoffs [21], [29]. In some cases, the cycle time of the system can be reduced by increasing the latency of some operations, e.g., by introducing more pipeline stages. By properly balancing cycle time and throughput, the system with the optimal effective cycle time can be achieved. The effective cycle time is a performance metric similar to the time-per-instruction (TPI) in CPU design. It captures how much time is required to process one token of information—the smaller the better. For example, let us assume that the design in Fig. 12 has a cycle period of 500 ps. Given that Th = 4/5, the effective cycle time is 625 ps (500/Th), indicating that a token is processed every 625 ps, on average.

One of the properties of elastic circuits is that they accept the insertion of empty buffers at arbitrary locations while preserving the behavior of the design. This can be done both for asynchronous circuits [36], [39] or synchronous elastic circuits [8], [33]. We often refer to these empty buffers as *bubbles*. The process of inserting bubbles into an asynchronous design was called *slack matching* [36], while for synchronous elastic—*recycling* [9].

The main reason for the insertion of *bubbles* is performance improvement. Fig. 13(a) shows an example after an optimal retiming. The combinational nodes (shown as circles) are labeled with their delays. The boxes represent 2-slot elastic buffers. Those labeled with a dot



Fig. 13. Design after (a) retiming and (b) retiming and recycling.

contain one token (valid data) and one bubble, whereas the unlabeled ones contain two bubbles. The cycle time of this design is 17 time units.

Fig. 13(b) shows an optimal configuration combining retiming and recycling. An empty elastic buffer is inserted into the bottom path. The cycle time has been reduced to 11 time units. The throughput is determined by the slowest cycle. The token/buffer ratios for each cycle are 1, 4/5, and 2/3. Therefore, the throughput is 2/3, and the average number of cycles to process a token is 3/2. This provides an effective cycle time of 16.5 time units (16.5 =  $11 \cdot 3/2$ ). It means that a new token is processed on average every 16.5 time units, which is an improvement compared to the 17 units of the optimally retimed design.

As explained in [11] another technique to optimize the elastic circuits is buffer sizing: increasing the capacity of EBs. Unlike recycling, buffer sizing can never degrade the cycle time of the design. Both techniques can be combined with retiming in an automated flow to optimize elastic designs. The next section will describe additional optimization techniques based on anti-tokens.

## VIII. ANTI-TOKENS

One of the important optimizations towards automatic pipelining was the introduction of early evaluation and antitokens. Early evaluation contributes to increase the performance of elastic systems, because the design does not have to stall waiting for irrelevant data. Early evaluation was proposed for asynchronous [1], [5] and synchronous designs [15].

# A. Early Evaluation

The execution model of conventional elastic systems is based on strict evaluation: a computation is initiated only when all input data are available. This requirement can be relaxed if early evaluation is used. For example, the behavior of a 2-input multiplexer can be modeled with the following statement:

z = if s then a else b.

In case s is available and its value is true, there is no need to wait for *b*; the result can be delivered as soon as *a* arrives. Similarly when s is false.



Fig. 14. Moving tokens and anti-tokens for early evaluation.

When using early evaluation, the spurious enabling of functional units must be prevented if the late inputs arrive after the completion of the computation. One of the mechanisms to discard late inputs is the use of negative tokens, also called anti-tokens. Each time an early evaluation occurs, an anti-token is generated at every non-required input in such a way that it annihilates when it meets a positive token [15].

Fig. 14(a) depicts a circuit with early evaluation. The circuit contains valid data in all registers with tokens. The "0"-input of the multiplexer has no valid data, however the control signal indicates that the "1"-input must be selected. In this situation, there is no need to wait for the "late" data produced by the ALU.

Fig. 14(b) shows how the "1"-input has been sent to the output of the multiplexer. Additionally, an anti-token (-1)has been sent backward to the "0"-input. In Fig. 14(c), the result from the ALU arrives, and then in Fig. 14(d) the valid data and the anti-token cancel each other, thus disregarding the nonselected data.

Various implementations exist both in synchronous (e.g., [12]) and asynchronous (e.g., [1], [47]) circuits. Among the different implementations of anti-tokens, two main classes have been distinguished: passive anti-tokens, when they statically wait for the arrival of tokens, and active anti-tokens, when they move backward to meet tokens.

Active anti-tokens have the advantage of disabling data transfers proactively, potentially providing power savings (fewer computations performed). Passive anti-tokens are simpler to implement and have a lower area overhead.

### **B.** Anti-Token Transformations

Anti-tokens can also be used to enable new retiming configurations beyond the ones presented in Section II-B. An empty EB is equivalent to an EB with one token of information followed by an anti-token injector with one anti-token, as shown in Fig. 15(a). An anti-token injector can be implemented with a simple control that contains an



Fig. 15. Anti-token transformations: (a) insertion, (b) grouping, (c) retiming, and (d) memory bypass.

up-down counter placed on the communication channel (without any latency penalty). When a token flows through a nonempty anti-token injector, the token and the antitoken cancel each other, updating the counter.

Anti-tokens can be retimed [as in Fig. 15(c)] and grouped [as in Fig. 15(b)]. However, anti-token counters cannot be retimed through computational units that have state or memory. Such state can be represented in the microarchitectural graph as a self-loop of the node, which will effectively disable retiming and propagation of anti-tokens.

### C. Re-Designing for Average Performance

The performance of a system with early evaluation is no longer determined by the slowest cycle, since average-case performance is achieved instead of worst-case. Early evaluation allows to design for the typical case in terms of data variability. Data that is not selected often can be safely delayed by a few clock cycles without a significant impact on the throughput. By using this technique, it is possible to achieve a cycle time reduction with small throughput degradation, thus improving the effective cycle time. Relaxing the timing constraints on parts of the design that are not used very often can also result in power and area savings.

Fig. 16(b) shows a design similar to Fig. 13(b), with an early evaluation multiplexer selecting between three branches. In this example, anti-token insertion has been applied to the dashed channel. Then, the new EB can be retimed backwards across the node with delay 8, and the bubble between nodes 8 and 9 can be removed. This new configuration has a cycle time of 11 units, but the estimated throughput using an ILP model [7] is higher, 0.918, since there is only one cycle with a bubble<sup>2</sup> compared to Fig. 16(a), where two of the three cycles have bubbles. The resulting effective cycle time is 11.98 units. This configuration can only be achieved by using the antitoken insertion transformation.

There is no known efficient exact method to compute the throughput of a system with early evaluation. An upper bound method using linear programming is presented in [27]. Each input must be assigned a probability so that the performance can be analyzed. Such probabilities should be obtained by running a typical application on the system, and then counting how often each input is selected.

# D. Memory Bypass

One of the key strategies for boosting performance in computer architecture is to parallelize instruction execution when data and control dependencies do not prevent it. A common technique for that is to bypass data that have not been yet committed to memory or to the register file.



Fig. 16. Optimal configuration with (a) early evaluation and (b) antitoken insertion. Boxes represents two-slot elastic buffers.

These schemes introduce multiplexers to forward data to the requesting units.

Fig. 15(d) depicts a transformation to introduce a memory (or register file) bypass [29]. The memory block is represented by the write logic (W), the read logic (R) and a buffer containing the information. A memory can be modeled as a large register that is read/written at every cycle, even though the read/write operations only affect a small subset of bits.

The transformation introduces an EB that delays the commitment of the write data (wd). Additionally, a register to delay the write address (wa) is required. The output multiplexer can select between the data coming from memory or the one coming from the noncommitted data. The selection is done depending on the coincidence of the read address (ra) and the previous write address.

The relevance of this transformation is twofold. On one hand, a new EB is introduced that can be retimed backwards searching for more pipelining opportunities that can reduce the cycle time. On the other hand, the output multiplexer can perform early evaluation and forward data one cycle earlier when no data dependencies exist between the read and write requests. Interestingly, this transformation can be applied iteratively, creating a several forwarding levels.

Memory bypass is one of the key enablers of architectural exploration. By applying a different number of bypass transformations in the architecture, various configurations trading-off area and performance can be generated and compared.

#### IX. AUTOMATIC PIPELINING

By using the techniques presented in previous sections, it is possible to automatically pipeline a design starting from a functional specification graph. Pipelining is achieved by by applying a sequence of correct-by-construction structural transformations that generate an elastic distributed control in which every block only controls the input/ output EBs and local computation blocks. The generated pipelines can handle dynamic data dependencies and do not rely on a static global schedule.

Bypassing memory elements using early-evaluation is essential for automatic pipelining, since they introduce

<sup>&</sup>lt;sup>2</sup>The cycle corresponds to the one with the anti-token (-1), as the sum of a token and an anti-token is equivalent to a bubble.

new EBs that can be retimed backwards. Finally, the system can be pipelined by retiming the EBs inserted with the bypasses and using other transformations such as recycling or anti-token insertion.

# A. Example: Pipelining a Reduced Instruction Set

Fig. 17(a) shows the specification of a simple design. The register file *RF* is the only state holding block. *IFD* fetches instructions and decodes the opcode and register addresses. *ADD* and *M* are arithmetic functions. The results are selected by the multiplexer for *RF* write-back. *M* has been divided into three stages. The potential boundaries for breaking up logic to allow pipelining is a design decision that is typically determined a priori, regardless whether pipelining will be applied or not.

In Fig. 17(b), three memory bypasses have been added to *RF* for building a bypass network. The node *DD* receives all previous write addresses and the current read address in order to detect any dependencies and determine which of the inputs of the bypass multiplexer must be selected.

The right-most multiplexer and the bypass EBs must be duplicated to feed each bypass path independently, enabling new forwarding paths, as shown in Fig. 17(c). Once the forwarding paths have been created, the design can be pipelined by applying retiming and anti-token insertion, as shown in Fig. 17(d). The final elastic pipeline is optimal in the sense that its distributed elastic controller inserts the minimum number of stalls. Furthermore the pipeline structure is not redundant since there are no duplicated nodes. Therefore, this is as good as a manually designed pipeline.

Fast instructions that require few cycles to compute, like *ADD* in this example, use the bypass network to

forward data avoiding extra stalls, while long instructions use the bypass network as a stall structure that handles data hazards. In this example, the only possible stalls occur when the paths with anti-token counters are selected by the early evaluation multiplexers. This situation corresponds to a RAW dependency involving a result computed by M, which needs three cycles to complete.

# **B.** Microarchitectural Exploration

As the graph specification grows, the number of possible pipelining configurations explodes exponentially, and manual exploration becomes complicated and errorprone. The retiming and recycling optimizations, including anti-token insertion and retiming, can be unified as a mixed integer linear programming problem (MILP) [7], and solved using linear programming tools. Therefore, automatic exploration of pipelines can be achieved by trying different combinations of bypasses added to each of the memory elements of the design, and then automatically pipelining each exploration point [20]. Since the retiming and recycling method can only compute an upper bound of the throughput instead of the exact throughput, the most promising designs should be simulated at the end of the exploration to identify the best one, or to study a possible trade-off between performance and area or power.

Let us illustrate this exploration method on a simple microarchitecture similar to a DLX, shown in Fig. 18 before pipelining. In this microarchitecture, there is no pipelining and every instruction is executed in one cycle with a long cycle time. The execution unit has an integer ALU and a long operation F. The instruction decoder ID produces the opcode, oc, that goes to the write-back multiplexer and a jump address, ja, which depends on the



Fig. 17. (a) Graph model of a simple design, (b) after 3 bypasses, (c) duplicate mux, enable forwarding, and (d) final pipeline after transformations.



Fig. 18. DLX initial graph.

previous ALU operation, stored in register BR. Table 1 shows approximate normalized area and delays of the functional blocks. These parameters have been obtained by synthesizing some of the blocks in a 65-nm technology library (ALU, RF, mux2, EB, and +4). The rest of the values have been estimated.

F is considered to be a floating point unit which may be divided into several blocks for pipelining  $(F_1, \ldots F_i)$ . The memory has a read latency of  $L_{\text{MEM}}$  cycles, which is set to 10 in Table 1. This is a typical value for the read latency of an on-chip cache memory.

Fig. 19 shows the effective cycle time and area of the best Pareto points found for different partitions of F. Using a set of realistic assumptions for the probability of each instruction and the probability of data hazards, a pipeline was generated automatically for each possible partition of F, collecting the effective cycle time of the resulting pipeline and its area. It also shows how many bypasses were added to the register file and the memory to introduce instruction parallelism. This figure illustrates how more pipelining implies better performance at the expense of more area. While the best performance is achieved with 6 stages, shorter pipelines (such as the ones circled in the figure) are simpler and can offer attractive solutions.

Fig. 20 shows one of the best design points, with F partitioned into three blocks. Three bypasses have been applied to RF and then EBs have been retimed to pipeline F. Note that an extra bubble has been inserted at the output of  $F_3$ : the reduction in the throughput due to this bubble is compensated by a larger improvement in cycle time. Without this bubble the critical path would include the delay of the multiplexers after  $F_3$ . This kind of decisions are driven by the probabilities at the multiplexers, and derived automatically by the MILP model.

Table 1 Delay, Area and Latency Numbers for DLX Example

| Block | Delay          | Area | Lat. | Block | Delay | Area              | Lat. |
|-------|----------------|------|------|-------|-------|-------------------|------|
| mux2  | 1.5            | 1.5  | 1    | EB    | 3.15  | 4.5               | 1    |
| ID    | 6.0            | 72   | 1    | +4    | 3.75  | 24                | 1    |
| ALU   | 13.0           | 1600 | 1    | F     | 80.0  | 8000              | 1    |
| RF W  | 6              | 6000 | 1    | RF R  | 11    | _                 | 1    |
| MEM W | ( <del>-</del> | -    | 1    | MEM R | -     | (S <del>=</del> 3 | 10   |



Fig. 19. Effective cycle time and area of the best pipelined design for different depths of F. (x,y) and (x,y,z) tuples represent the depth of F, the number of bypasses applied to RF and to MEM (z = 9 if omitted).

The bypasses in the memory MEM implicitly create a load-store buffer to hide memory latency, as shown in Fig. 20. Such structure can be substituted by a more efficient implementation: an associative memory. The exploration algorithm can detect the need for such loadstore buffers and calculate their optimal size.

Fig. 21 illustrates how a simple program executes in this example. Fig. 21(a) shows the trace of a set of 5 instructions executed in the original DLX design from Fig. 18. In this nonpipelined design, the cycle time is 107.65 units (using the data in Table 1). Each instruction takes one cycle to complete and the program runs in 538.25 time units.

Fig. 21(b) shows the trace corresponding to the execution of the same program in the pipelined version of the DLX shown in Fig. 20. In this case, since there are three stages in F, the cycle time is reduced down to 30 units. However, it needs 11 cycles to complete, so the total execution time is 330 units of time. The ALU stage in the trace corresponds to the execution of the ALU function in Fig. 20, while ALU2 is the empty stage starting with the EB right next to the ALU unit.

There are two data dependencies during this execution run: I4 reads the result from I2, and I5 from I4. The data dependencies are drawn with an arrow in the trace. For the first dependency, the execution of I4 and I5 is stalled during two cycles as the data hazard cannot be resolved until F1-F2-F3 is completed. Therefore, two bubbles are introduced in the pipeline (shaded boxes in Fig. 21). For the second dependency, there is no need to stall, since the ALU unit completes within one cycle, and there is one forwarding path that can be used to bypass the result to the next instruction. This forwarding path is represented by a dashed line in Fig. 20.



Fig. 20. Pipelined DLX (F split into 3 blocks. RF has 3 bypasses and M 9).



Fig. 21. Execution of five instructions for the unpipelined and pipelined designs of the DLX example in this section, (a) in the unpipelined DLX (b) in the pipelined DLX. The 5 instructions are I1 = ALU, I2 = F, I3 = F, I4 = ALU, I5 = ALU. I4 needs to read the result from I2, and I5 needs to read the result from I4.

As explained above, this example shows how the bubble added at the end of the F pipeline is useful to further reduce the cycle time. If instructions using F occur with low probability, this bubble will be automatically inserted by the exploration algorithm, thus reducing cycle time with a small throughput penalty. If F is executed with high probability, the bubble will not be inserted because the gain in cycle time will not compensate the throughput penalty.

# X. CONCLUSION

After the classical optimizations in logic synthesis, automatic pipelining is the next step in design automation that contributes to improve the power/performance tradeoffs in system design.

The article has presented the historical evolution of design automation that has enabled novel optimizations for pipelining. Although the proposed transformations can improve performance by introducing handshake-based elastic timing, the behavior still preserves the so-called in-order execution, i.e., the generated traces transfer information in the same order as in the original non-elastic system.

Several problems still remain opened. Equivalence checking is one of the main challenges for transformations that modify timing beyond the natural boundaries of RTL state signals. Even today, retiming has a very limited use due to the low scalability of the sequential equivalence algorithms. Some initial efforts have already been published [53], but there is still a long way to go until commercial verification tools can adopt this technology.

Automation for out-of-order execution is another challenge that deserves further investigation. While this paradigm is highly exploited in advanced CPUs, there is still little progress in automatic synthesis. ■

### REFERENCES

- [1] M. Ampalam and M. Singh, "Counterflow pipelining: Architectural support for preemption in asynchronous systems using anti-tokens," in Proc. Int. Conf. Computer-Aided Design (ICCAD), 2006, pp. 611-618.
- [2] ARM Limited, AMBA Specification (Rev 2.0), 1999
- [3] R. Brayton, G. Hachtel, and  $A. \ Sangiovanni-Vincentelli, ``Multilevel"$ logic synthesis," Proc. IEEE, vol. 78, no. 2, pp. 264-300, Feb. 1990.
- [4] R. K. Brayton, A. L. Sangiovanni-Vincentelli, C. T. McMullen, and G. D. Hachtel, Logic Minimization Algorithms for VLSI Synthesis. Norwell, MA, USA: Kluwer Academic, 1984.
- [5] C. Brej and J. Garside, "Early output logic using anti-tokens," in Proc. Int. Workshop on Logic Synthesis, May 2003, pp. 302–309.

  [6] D. Buchfuhrer and C. Umans, "The
- complexity of Boolean formula minimization," J. Comput. Sci. Syst. Sci.,
- vol. 77, no. 1, pp. 142–153, Jan. 2011. D. Bufistov *et al.*, "Retiming and recycling for elastic systems with early evaluation," in Proc.
- ACM/IEEE Design Automation Conf., July 2009, pp. 288-291.
- L. Carloni, K. McMillan, and A. Sangiovanni-Vincentelli, "Theory of latency-insensitive design," IEEE Trans. Computer-Aided Design, vol. 20, no. 9, pp. 1059-1076, Sept. 2001.
- L. Carloni and A. Sangiovanni-Vincentelli, "Combining retiming and recycling to optimize the performance of synchronous circuits," in Proc. 16th Symp. Integr. Circuits Syst. Design (SBCCI), Sep. 2003, pp. 47-52.

- [10] L. P. Carloni, K. L. McMillan, A. Saldanha, and A. L. Sagiovanni-Vincentelli, "A methodology for correct-by-construction latency insensitive design," in *Proc. Int. Conf.* Computer-Aided Design (ICCAD), Nov. 1999, р. 309–315.
- [11] J. Carmona, J. Cortadella, M. Kishinevsky, and A. Taubin, "Elastic circuits," IEEE Trans. Computer-Aided Design Integr. Circuits Syst.,
- vol. 28, no. 10, pp. 1437–1455, Oct. 2009.
  [12] M. Casu and L. Macchiarulo, "Adaptive latency-insensitive protocols," *IEEE Design* Test Comput., vol. 24, no. 5, pp. 442-452, 2007.
- [13] S. Chatterjee, "On Algorithms for Technology Mapping," Ph.D. thesis, University of California at Berkeley, Berkeley, CA, USA,
- [14] F. Commoner, A. W. Holt, S. Even, and A. Pnueli, "Marked directed graphs, . Comput. Syst. Sci., vol. 5, pp. 511-523, 1971.
- [15] J. Cortadella and M. Kishinevsky, 'Synchronous elastic circuits with early evaluation and token counterflow," in Proc. ACM/IEEE Design Automat. Conf. (DAC), Jun. 2007, pp. 416–419. [16] J. Cortadella, M. Kishinevsky, and
- B. Grundmann, "Synthesis of synchronous elastic architectures," in  $Proc.\ ACM/IEEE$ Design Automation Conference (DAC), Jul. 2006, pp. 657-662.
- [17] P. Coussy and A. Morawiec, Eds., High-Level Synthesis: From Algorithm to Digital Circuit. New York, NY, USA: Springer-Verlag, 2008.
- [18] E. J. McCluskey, Jr., "Minimization of Boolean functions," *Bell System Technical* Journal, vol. 35, no. 6, pp. 1417-1444, Nov. 1956.
- [19] J. P. Fishburn, "Clock skew optimization," IEEE Trans. Computers, vol. 39, no. 7, pp. 945–951, Jul. 1990.
- [20] M. Galceran-Oms, J. Cortadella, M. Kishinevsky, and D. Bufistov, "Automatic microarchitectural pipelining," Design, Automat. Test Europe, pp. 961-964, Apr. 2010.
- [21] M. Galceran-Oms, A. Gotmanov, J. Cortadella, and M. Kishinevsky, "Microarchitectural transformations using elasticity," ACM J Emerg. Technol. Comput. Syst. (JETC), vol. 7, no. 4, p. 18, 2011.
- [22] S. Gupta, N. Dutt, R. Gupta, and A. Nicolau, "Spark: A high-level synthesis framework for applying parallelizing compiler transformations," in Proc. Int. Conf. VLSI Design, Jan. 2003, pp. 461–466.
  [23] S. Gupta, R. Gupta, N. Dutt, and A. Nicolau,
- 'Coordinated parallelizing compiler optimizations and high-level synthesis,"

- ACM Trans. Design Automat. Electron. Syst., vol. 9, no. 4, pp. 441–470, Oct. 2004. C.-T. Hwang, Y.-C. Hsu, and Y.-L. Lin, "PLS:
- A scheduler for pipeline synthesis," IEEE Trans. Computer-Aided Design, vol. 12, no. 9, pp. 1279–1286, Sept. 1993.
  [25] IEEE Standard for a Versatile Backplane Bus:
- VMEbus, IEEE Std. 1014-1987 (R2008), 1987.
- [26] H. M. Jacobson et al., "Synchronous interlocked pipelines," in Proc. Int. Symp. Adv. Res. Asynchr. Circuits Syst., Apr. 2002, pp. 3-12.
- J. Júlvez, J. Cortadella, and M. Kishinevsky, Performance analysis of concurrent systems with early evaluation," in Proc. Int. Conf. Computer-Aided Design (ICCAD), Nov. 2006, pp. 448-455.
- [28] G. Kahn, "The semantics of a simple language for parallel programming," in *Information*Processing, J. L. Rosenfeld, Ed. Stockholm, Sweden: North Holland, Amsterdam, Aug. 1974, pp. 471-475.
- [29] T. Kam, M. Kishinevsky, J. Cortadella, and M. Galceran-Oms, "Correct-by-construction microarchitectural pipelining," in Proc. Int. Conf. Computer-Aided Design (ICCAD), 2008, рр. 434–441.
- [30] K. Keutzer, "DAGON: Technology binding and local optimization by DAG matching," in Proc. Design Automat. Conf., Jun. 1987, pp. 341-347.
- A. Kondratyev, L. Lavagno, M. Meyer, and Y. Watanabe, "Realistic performanceconstrained pipelining in high-level synthesis," in *Proc. Design, Automat. Test* Europe (DATE), Mar. 2011, pp. 1-6.
- [32] D. Kroening and W. Paul, "Automated pipeline design," in Proc. ACM/IEEE Design
- Automat. Conf., Jun. 2001, pp. 810–815.
  [33] S. Krstic, J. Cortadella, M. Kishinevsky, and J. O'Leary, "Synchronous elastic networks," in *Proc. FMCAD*, 2006, pp. 19–30.
- [34] C. Leiserson and J. Saxe, "Retiming synchronous circuitry," Algorithmica, vol. 6, no. 1, pp. 5–35, 1991.
- [35] N. Maheshwari and S. S. Sapatnekar, "An improved algorithm for minimum-area retiming," in Proc. ACM/IEEE Design Autom. Conf., 1997, pp. 2-7.
- [36] R. Manohar and A. J. Martin, "Slack elasticity in concurrent computing," in Proc. 4th Int. Conf. Math. Program Construct., vol. 1422, Lecture Notes in Computer Science, J. Jeuring, Ed., 1998, pp. 272-285.
- C. E. Molnar, T.-P. Fang, and F. U. Rosenberger, "Synthesis of delay-insensitive modules," in Proc. Chapel Hill Conf. Very Large Scale Integr., H. Fuchs, Ed., 1985, pp. 67-86.
- [38] D. E. Muller, "Asynchronous logics and application to information processing," in

- Proc. Symp. Applicat. Switch. Theory Space Technol., 1962, pp. 289-297.
- [39] D. E. Muller and W. S. Bartky, "A theory of asynchronous circuits," in Proc. Int. Symp. Theory Switch., Apr. 1959, pp. 204-243.
- [40] T. Murata, "Petri Nets: Properties, analysis and applications," *Proc. IEEE*, vol. 77, no. 4, pp. 541–580, Apr. 1989.
- [41] E. Nurvitadhi, J. Hoe, T. Kam, and S.-L. L. Lu, "Automatic pipelining from transactional datapath specifications," IEEE Trans. Computer-Aided Design, vol. 30, no. 3, pp. 441-454, Mar. 2011.
- [42] J. O'Leary and G. Brown, "Synchronous emulation of asynchronous circuits," IEEE Trans. Computer-Aided Design, vol. 16, no. 2, pp. 205-209, Feb. 1997.
- [43] C. V. Ramamoorthy and G. S. Ho, "Performance evaluation of asynchronous concurrent systems using Petri nets,' IEEE Trans. Software Eng., vol. 6, no. 5, pp. 440–449, 1980. [44] M. Riedel and J. Bruck, "Cyclic boolean
- circuits," Discr. Appl. Math., vol. 160, no. 3, pp. 1877–1900, 2012.
  [45] C. L. Seitz, "System timing," in *Introduction to*
- VLSI Systems, C. A. Mead and L. A. Conway, Eds. Boston, MA, USA: Addison-Wesley, 1980.
- [46] N. Shenoy and R. Rudell, "Efficient implementation of retiming," in Proc. IEEE/ ACM Int. Conf. Computer-Aided Design, 1994, pp. 226-233.
- [47] D. Sokolov, I. Poliakov, and A. Yakovlev, "Analysis of static data flow structures, Fundamenta Informaticae, vol. 88, no. 4, pp. 581–610, 2008.
- [48] J. Sparsø and S. Furber, Eds., Principles of Asynchronous Circuit Design: A Systems Perspective. Norwell, MA, USA: Kluwer Academic, 2001.
- [49] J. Sparsø and J. Staunstrup, "Delay-insensitive multi-ring structures," Integr., VLSI J., vol. 15, no. 3, pp. 313–340, Oct. 1993.
- [50] I. E. Sutherland, "Micropipelines," Commun. ACM, vol. 32, no. 6, pp. 720–738, Jun. 1989. [51] C. Umans, T. Villa, and
- A. Sangiovanni-Vincentelli, "Complexity of two-level logic minimization," IEEE Trans. Computer-Aided Design, vol. 25, no. 7, pp. 1230-1246, Jul. 2006.
- [52] T. Verhoeff, "Delay-insensitive codes—An overview," Distrib. Comput., vol. 3, no. 1, pp. 1-8, 1988.
- V. Wijayasekara and S. Srinivasan, "Equivalence checking for synchronous elastic circuits," in Proc. IEÉE/ACM Int. Conf. Formal Meth. Models Codesign (MEMOCODE), Oct. 2013, pp. 109-118.

### ABOUT THE AUTHORS

Jordi Cortadella (Fellow, IEEE) received the M.S. and Ph.D. degrees in computer science from the Universitat Politècnica de Catalunya, Barcelona, Spain, in 1985 and 1987, respectively.

He is a Professor in the Department of Computer Science of the same university and member of the Academia Europaea. His research interests include formal methods and computer-aided design of VLSI systems with special emphasis on asynchronous circuits, concurrent systems, and logic synthesis.



Dr. Cortadella has served on the technical committees of several international conferences in the field of Design Automation and Concurrent Systems. He received best paper awards at the International Symposium on Advanced Research in Asynchronous Circuits and Systems (2004), the Design Automation Conference (2004) and the International Conference on Application of Concurrency to System Design (2009). In 2003, he was the recipient of a Distinction for the Promotion of the University Research by the Generalitat de Catalunya.

Marc Galceran-Oms received the M.S. and Ph.D. degrees in computer science from the Universitat Politecnica de Catalunya, Barcelona, Spain, in 2007 and 2011, respectively.

His research interests include formal methods in system design, logic synthesis and data mining. He is currently in a back-end design group in eSilicon. Prior to that, he was at an asynchronous design start-up company (Elastix) and as an intern at the Strategic CAD Labs group of Intel.



Mike Kishinevsky (Senior Member, IEEE) received the M.S. and Ph.D. degrees in computer science from the Electrotechnical University of St. Petersburg, St. Petersburg, Russia.

He leads a research group in front-end design at Strategic CAD Labs of Intel. Prior to joining Intel in 1998, he has been a Research Fellow at the Russian Academy of Science, a Senior Researcher at a start-up in asynchronous design (TRASSA), a Visiting Associate Professor at the Technical



University of Denmark, and a Professor at the University of Aizu, Fukushima, Japan. He has served on the technical program committee at several conferences and workshops.

Dr. Kishinevsky received the Semiconductor Research Corporation outstanding mentor awards (2004 and 2010), and the best paper awards at the Design Automation Conference (2004) and the International Conference on Application of Concurrency to System Design (2009).

Sachin S. Sapatnekar (Fellow, IEEE) received the B.Tech. degree from the Indian Institute of Technology, Bombay, India, the M.S. degree from Syracuse University, Syracuse, NY, USA, and the Ph.D. degree from the University of Illinois, Urbana-Champaign, IL, USA.





Dr. Sapatnekar has received six conference Best Paper awards, a Best Poster Award, an ICCAD 10-year Retrospective Most Influential Paper Award, the SRC Technical Excellence award, and the SIA University Researcher Award.